Attrition is the departure of employees from the organisation/company for any reason(voluntarly or involuntary),including resignation,termination,death or retirement. Employee are the backbone of any organisation and thier departing may lead to lot of various losses. Employee attrition becomes a detrimental factor that affects companies that results to lot of various losses on different aspects. Disadvantages of employee attrition includes the following:
Given an employee dataset, uncover the factors that lead to employee attrition and compare average monthly income by education and attrition.
Age : Employee age
Attrition: Whether the employee attrition was recorded as YES OR NO
BusinessTravel: Whether employee was going for business travel or not
DailyRate: Employee's dialy rate
Department: The department to which the employee is working
DistanceFromHome: Employee distances from thier homes to work.
Education: The employee level of education rated from 1-5.
EducationField:Employee education field that includes Life science,medical,marketing,technical degree,human resources and other.
EmployeeCount: Count the number of employee in the organisation.
EmployeeNumber: The number given to every employee
EnvironmentSatisfaction: Employee environmental satisfaction rated from 1-4.
Gender : Whether the employee is a male or Female
HourlyRate : Employee's hourly rate
JobInvolvement : Employee job involment rating from 1-4.
JobLevel : Employee job level rated from 1-5.
JobRole : Employee's job role
JobSatisfaction : Employee level of satisfication
MaritalStatus : Employee's Marital status
MonthlyIncome: Employee 's monthly income
MonthlyRate : Employee 's monthly rate
NumCompaniesWorked : The number of companies the employee once worked
Over18: Whether the employee is over 18 or not
OverTime : Whether the employee works overtime or not (Yes/No).
PercentSalaryHike: salary hike percentage rate.
PerformanceRating : Employee performance rating.
RelationshipSatisfaction: Employee relationship satisfaction ratings.
StandardHours: Standard employee working hours required by the organisation.
StockOptionLevel: Employee stock option level.
TotalWorkingYears: Total number of years foran employee working for the organisation.
TrainingTimesLastYear: Number of times the training was done for last year.
WorkLifeBalance: Employee work life balance rating.
YearsAtCompany: Employee number of years in service.
YearsInCurrentRole : Employee number of years in the current role.
YearsSinceLastPromotion: Years since last promotion to an employee.
YearsWithCurrManager: Employee's number of years with the current manager.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to assist with visualization of data
import matplotlib.pyplot as plt
import seaborn as sns
# To tune model, get different metric scores, and split data
from sklearn.metrics import (f1_score,accuracy_score,recall_score, precision_score)
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling
from sklearn.preprocessing import StandardScaler
# To do hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (AdaBoostClassifier,GradientBoostingClassifier, RandomForestClassifier,BaggingClassifier,)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, SVR
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")
# Now lets read the employee attrition data
data = pd.read_csv("67714_HR_Employee_Attrition.csv")
# Print the first five rows of the dataset
data.head()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
# Print the last five rows of the dataset
data.tail()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1465 | 36 | No | Travel_Frequently | 884 | Research & Development | 23 | 2 | Medical | 1 | 2061 | 3 | Male | 41 | 4 | 2 | Laboratory Technician | 4 | Married | 2571 | 12290 | 4 | Y | No | 17 | 3 | 3 | 80 | 1 | 17 | 3 | 3 | 5 | 2 | 0 | 3 |
| 1466 | 39 | No | Travel_Rarely | 613 | Research & Development | 6 | 1 | Medical | 1 | 2062 | 4 | Male | 42 | 2 | 3 | Healthcare Representative | 1 | Married | 9991 | 21457 | 4 | Y | No | 15 | 3 | 1 | 80 | 1 | 9 | 5 | 3 | 7 | 7 | 1 | 7 |
| 1467 | 27 | No | Travel_Rarely | 155 | Research & Development | 4 | 3 | Life Sciences | 1 | 2064 | 2 | Male | 87 | 4 | 2 | Manufacturing Director | 2 | Married | 6142 | 5174 | 1 | Y | Yes | 20 | 4 | 2 | 80 | 1 | 6 | 0 | 3 | 6 | 2 | 0 | 3 |
| 1468 | 49 | No | Travel_Frequently | 1023 | Sales | 2 | 3 | Medical | 1 | 2065 | 4 | Male | 63 | 2 | 2 | Sales Executive | 2 | Married | 5390 | 13243 | 2 | Y | No | 14 | 3 | 4 | 80 | 0 | 17 | 3 | 2 | 9 | 6 | 0 | 8 |
| 1469 | 34 | No | Travel_Rarely | 628 | Research & Development | 8 | 3 | Medical | 1 | 2068 | 2 | Male | 82 | 4 | 2 | Laboratory Technician | 3 | Married | 4404 | 10228 | 2 | Y | No | 12 | 3 | 1 | 80 | 0 | 6 | 3 | 4 | 4 | 3 | 1 | 2 |
data.shape
(1470, 35)
print(f" The Employee attrition dataset consist of {data.shape[0]} rows and {data.shape[1]} columns")
The Employee attrition dataset consist of 1470 rows and 35 columns
# Code to display the dataset data types
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KB
## Drop unneccessary columns
data=data.drop(columns=["EmployeeCount","EmployeeNumber","Over18","StandardHours"],axis=1)
data.shape
(1470, 31)
# Code to print the statistical summary of the numerical columns in a dataset
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 1470.000 | 36.924 | 9.135 | 18.000 | 30.000 | 36.000 | 43.000 | 60.000 |
| DailyRate | 1470.000 | 802.486 | 403.509 | 102.000 | 465.000 | 802.000 | 1157.000 | 1499.000 |
| DistanceFromHome | 1470.000 | 9.193 | 8.107 | 1.000 | 2.000 | 7.000 | 14.000 | 29.000 |
| Education | 1470.000 | 2.913 | 1.024 | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 |
| EnvironmentSatisfaction | 1470.000 | 2.722 | 1.093 | 1.000 | 2.000 | 3.000 | 4.000 | 4.000 |
| HourlyRate | 1470.000 | 65.891 | 20.329 | 30.000 | 48.000 | 66.000 | 83.750 | 100.000 |
| JobInvolvement | 1470.000 | 2.730 | 0.712 | 1.000 | 2.000 | 3.000 | 3.000 | 4.000 |
| JobLevel | 1470.000 | 2.064 | 1.107 | 1.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| JobSatisfaction | 1470.000 | 2.729 | 1.103 | 1.000 | 2.000 | 3.000 | 4.000 | 4.000 |
| MonthlyIncome | 1470.000 | 6502.931 | 4707.957 | 1009.000 | 2911.000 | 4919.000 | 8379.000 | 19999.000 |
| MonthlyRate | 1470.000 | 14313.103 | 7117.786 | 2094.000 | 8047.000 | 14235.500 | 20461.500 | 26999.000 |
| NumCompaniesWorked | 1470.000 | 2.693 | 2.498 | 0.000 | 1.000 | 2.000 | 4.000 | 9.000 |
| PercentSalaryHike | 1470.000 | 15.210 | 3.660 | 11.000 | 12.000 | 14.000 | 18.000 | 25.000 |
| PerformanceRating | 1470.000 | 3.154 | 0.361 | 3.000 | 3.000 | 3.000 | 3.000 | 4.000 |
| RelationshipSatisfaction | 1470.000 | 2.712 | 1.081 | 1.000 | 2.000 | 3.000 | 4.000 | 4.000 |
| StockOptionLevel | 1470.000 | 0.794 | 0.852 | 0.000 | 0.000 | 1.000 | 1.000 | 3.000 |
| TotalWorkingYears | 1470.000 | 11.280 | 7.781 | 0.000 | 6.000 | 10.000 | 15.000 | 40.000 |
| TrainingTimesLastYear | 1470.000 | 2.799 | 1.289 | 0.000 | 2.000 | 3.000 | 3.000 | 6.000 |
| WorkLifeBalance | 1470.000 | 2.761 | 0.706 | 1.000 | 2.000 | 3.000 | 3.000 | 4.000 |
| YearsAtCompany | 1470.000 | 7.008 | 6.127 | 0.000 | 3.000 | 5.000 | 9.000 | 40.000 |
| YearsInCurrentRole | 1470.000 | 4.229 | 3.623 | 0.000 | 2.000 | 3.000 | 7.000 | 18.000 |
| YearsSinceLastPromotion | 1470.000 | 2.188 | 3.222 | 0.000 | 0.000 | 1.000 | 3.000 | 15.000 |
| YearsWithCurrManager | 1470.000 | 4.123 | 3.568 | 0.000 | 2.000 | 3.000 | 7.000 | 17.000 |
# code to check for missing values in a dataset
data.isnull().sum()
Age 0 Attrition 0 BusinessTravel 0 DailyRate 0 Department 0 DistanceFromHome 0 Education 0 EducationField 0 EnvironmentSatisfaction 0 Gender 0 HourlyRate 0 JobInvolvement 0 JobLevel 0 JobRole 0 JobSatisfaction 0 MaritalStatus 0 MonthlyIncome 0 MonthlyRate 0 NumCompaniesWorked 0 OverTime 0 PercentSalaryHike 0 PerformanceRating 0 RelationshipSatisfaction 0 StockOptionLevel 0 TotalWorkingYears 0 TrainingTimesLastYear 0 WorkLifeBalance 0 YearsAtCompany 0 YearsInCurrentRole 0 YearsSinceLastPromotion 0 YearsWithCurrManager 0 dtype: int64
# Code to check for duplicates in the dataset
data.duplicated().sum()
0
# Code to make a copy of the original dataset
data_emp = data.copy()
data_emp.shape
(1470, 31)
# Select numerical colums from the dataset
numerical_columns =data_emp.select_dtypes("number").columns
### Code to create a function to display histogram and boxplot for numerical dataset
def box_histplot(data,item):
plt.figure(figsize=(15,5)) # increase the size of the plot
plt.title(f"Histogram for {item}") # Give the graph a title
plt.xlabel(item) # change the label on the x-axis
plt.ylabel("frequency") # change the label on the y-axis
sns.histplot(data=data,x=item,kde=True); # histogram for numerical dataset
plt.axvline(data[item].mean(),color="black",linestyle="--")
plt.axvline(data[item].median(),color="red",linestyle="-")
plt.show()
plt.figure(figsize=(15,5)) # increase the size of the plot
plt.title(f"Boxplot for {item}") # Give the plot a suitable title
sns.boxplot(data=data,x=item,showmeans=True); # Boxplot for numerical dataset
plt.xlabel(item) # change the label on the x-axis
plt.show()
# Code the call the function to display the histogram and boxplots for numericl columns
for item in data_emp[numerical_columns]:
box_histplot(data_emp,item)
# Code to select categorical columns
categorical_data = data_emp.select_dtypes("object").columns
## Create a function to display the bar graphs for categorical data columns
def count_plot(data,item):
plt.figure(figsize=(5,4)) # increase the size of the plot
sns.countplot(data=data , x=item); # countplot for categorical columns
plt.title(f"Countplot for {item}") # add the title on the countplot
plt.ylabel("frequency") # add the label on the y-axis
plt.xticks(rotation=90)
plt.show()
# Print the countplot for attrition
count_plot(data_emp,"Attrition")
data_emp["Attrition"].value_counts(normalize=True)
No 0.839 Yes 0.161 Name: Attrition, dtype: float64
### Construct a pie chart for the attrition data
values=[] # create an empty list
attrition=["Yes","No"]
for item in attrition:
count1= (data_emp[data_emp["Attrition"]==item].value_counts().sum())/data_emp["Attrition"].shape[0]
values.append(count1)
# Create a pie chart to show the percentage for attrition
plt.figure(figsize=(4,5))
plt.pie(values,labels=attrition,autopct="%1.1f%%")
plt.show()
About 84% of the employees are not on attrition whereas 16% are on attrition.
# Print the countplot for Business Travel
count_plot(data_emp,"BusinessTravel")
data_emp["BusinessTravel"].value_counts(normalize=True)
Travel_Rarely 0.710 Travel_Frequently 0.188 Non-Travel 0.102 Name: BusinessTravel, dtype: float64
# Create a pie chart to show the percentage for Business Travel
plt.figure(figsize=(4,5))
values = data_emp["BusinessTravel"].value_counts() /data_emp.shape[0]
plt.pie(values,labels=values.keys(),autopct="%1.1f%%")
plt.show()
About 71% of the employees in the organisation rarely travel.
19% of the employees travels frequently with 10% of those that do not travel.
# Print the countplot department
count_plot(data_emp,"Department")
# Create a pie chart to show the percentage for Department
plt.figure(figsize=(4,5))
values = data_emp["Department"].value_counts() /data_emp.shape[0]
plt.pie(values,labels=values.keys(),autopct="%1.1f%%")
plt.show()
# Print the countplot for education field for employees
count_plot(data_emp,"EducationField")
# Create a pie chart to show the percentage for Educational field
plt.figure(figsize=(4,5))
values = data_emp["EducationField"].value_counts() /data_emp.shape[0]
plt.pie(values,labels=values.keys(),autopct="%1.1f%%")
plt.show()
# Print countplot for Gender of employees
count_plot(data_emp,"Gender")
# Create a pie chart to show the percentage for Gender
plt.figure(figsize=(4,5))
values = data_emp["Gender"].value_counts() /data_emp.shape[0]
plt.pie(values,labels=values.keys(),autopct="%1.1f%%")
plt.show()
60% of the employees in the company are males and 40% are females.
# print the countplot for employee jobroles
count_plot(data_emp,"JobRole")
# Create a pie chart to show the percentage for Job Role
plt.figure(figsize=(4,5))
values = data_emp["JobRole"].value_counts() /data_emp.shape[0]
plt.pie(values,labels= values.keys(),autopct="%1.1f%%")
plt.show()
About 22% of the employees are sales executives.
20% are Research Scientist with 4% being the human resources employees.
# print the countplot for employee marital status
count_plot(data_emp,"MaritalStatus")
# Create a pie chart to show the percentage for marital status
plt.figure(figsize=(4,5))
values = data_emp["MaritalStatus"].value_counts() /data_emp.shape[0]
plt.pie(values,labels=values.keys(),autopct="%1.1f%%")
plt.show()
# print the countplot for overtime status for employees
count_plot(data_emp,"OverTime")
data_emp["OverTime"].value_counts(normalize=True)
No 0.717 Yes 0.283 Name: OverTime, dtype: float64
# Create a pie chart to show the percentage for overtime
plt.figure(figsize=(4,5))
values = data_emp["OverTime"].value_counts() /data_emp.shape[0]
plt.pie(values,labels=values.keys(),autopct="%1.1f%%")
plt.show()
72% of the employees are not working overtime, which means they observe thier standard working hours required by the company.
This might imply that thier work life balance is relatively higher as compared to those working overtime.
# Code the call the function to display the histogram and boxplots for numericl columns
for item in data_emp[categorical_data]:
count_plot(data_emp,item)
## Create a function to display boxplots for numerical and categorical dataset
def box_plot(data,item_1,categorical_item):
plt.figure(figsize=(15,7)) # Increase the size of the plot
sns.boxplot(data=data,x = item_1, y = categorical_item, palette = "Paired_r")
plt.title(f"The boxplot for {item_1} relative to {categorical_item}")
plt.xlabel(item_1)
plt.ylabel(categorical_item)
plt.show()
# Boxplot for monthly income relative to attrition
box_plot(data_emp,"MonthlyIncome","Attrition")
The monthly income for employee who are not on attrition is skewed to the right with mean greater than the median.
There are more outliers on the right of the whiskers for the monthly income.
75% of the employees who are not on attrition earns a monthly income of roughly 7800 dollars and above.
Most employee that are on attrition thier average monthly income is less than those who are not on attrition.
This might imply that monthly income has in impact towards employee attrition.
# Boxplot for monthly income relative to gender
box_plot(data_emp,"MonthlyIncome","Gender")
The monthly income data relative to gender is skewed to the right with outliers on the right side of the whisker.
However the average monthly income for females is higher than those of the males.
75% of the females employees earn a monthly income of roughly 8750 dollar and above, which is higher than that of males.
These means that females employees in the organisation earns more than males
# Boxplot for monthly income relative to jobroles
box_plot(data_emp,"MonthlyIncome","JobRole")
# Boxplot for monthly income relative to department.
box_plot(data_emp,"MonthlyIncome","Department")
On average sales employees earns higher monthly income as compared to those working in R&D and Human resources.
# Code to print the boxplot for numerical columns relative to categorical data
for variable in numerical_columns:
box_plot(data,variable,categorical_data[0])
## Create a function to display a countplot for bivariate categorical columns relatives to churn data
def bivariate_plot(data,item_1,item_2):
plt.figure(figsize=(10,7))
sns.countplot(x=item_1,data=data,palette="YlGnBu",hue=item_2)
plt.ylabel("frequency")
plt.show()
bivariate_plot(data_emp,"Attrition","Gender")
round(data.groupby("Attrition")["Gender"].value_counts(normalize=True),3)*100
Attrition Gender
No Male 59.400
Female 40.600
Yes Male 63.300
Female 36.700
Name: Gender, dtype: float64
bivariate_plot(data_emp,"Gender","Attrition")
round(data.groupby("Gender")["Attrition"].value_counts(normalize=True),3)*100
Gender Attrition
Female No 85.200
Yes 14.800
Male No 83.000
Yes 17.000
Name: Attrition, dtype: float64
bivariate_plot(data_emp,"Department","Attrition")
bivariate_plot(data_emp,"EducationField","Attrition")
round(data.groupby("EducationField")["Attrition"].value_counts(normalize=True),3)*100
EducationField Attrition
Human Resources No 74.100
Yes 25.900
Life Sciences No 85.300
Yes 14.700
Marketing No 78.000
Yes 22.000
Medical No 86.400
Yes 13.600
Other No 86.600
Yes 13.400
Technical Degree No 75.800
Yes 24.200
Name: Attrition, dtype: float64
## Create a function to display a countplot for bivariate categorical columns relatives to attrition
def bivariate_plot1(data,item_1,item_2):
plt.figure(figsize=(10,7))
sns.countplot(x=item_1,data=data,palette="YlGnBu",hue=item_2)
plt.ylabel("frequency")
plt.xticks(rotation=90)
plt.show()
bivariate_plot1(data_emp,"JobRole","Attrition")
bivariate_plot(data_emp,"MaritalStatus","Attrition")
bivariate_plot(data_emp,"OverTime","Attrition")
bivariate_plot(data_emp,"Gender","OverTime")
## Construct a correlation matrix
plt.figure(figsize=(20, 20))
sns.heatmap(data_emp.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numerical_columns):
plt.subplot(4,7, i + 1)
plt.boxplot(data_emp[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
## Creating dummy variables
data_emp =pd.get_dummies(data_emp,drop_first=True)
data_emp.head()
| Age | DailyRate | DistanceFromHome | Education | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | Attrition_Yes | BusinessTravel_Travel_Frequently | BusinessTravel_Travel_Rarely | Department_Research & Development | Department_Sales | EducationField_Life Sciences | EducationField_Marketing | EducationField_Medical | EducationField_Other | EducationField_Technical Degree | Gender_Male | JobRole_Human Resources | JobRole_Laboratory Technician | JobRole_Manager | JobRole_Manufacturing Director | JobRole_Research Director | JobRole_Research Scientist | JobRole_Sales Executive | JobRole_Sales Representative | MaritalStatus_Married | MaritalStatus_Single | OverTime_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 1102 | 1 | 2 | 2 | 94 | 3 | 2 | 4 | 5993 | 19479 | 8 | 11 | 3 | 1 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| 1 | 49 | 279 | 8 | 1 | 3 | 61 | 2 | 2 | 2 | 5130 | 24907 | 1 | 23 | 4 | 4 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 2 | 37 | 1373 | 2 | 2 | 4 | 92 | 2 | 1 | 3 | 2090 | 2396 | 6 | 15 | 3 | 2 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3 | 33 | 1392 | 3 | 4 | 4 | 56 | 3 | 1 | 3 | 2909 | 23159 | 1 | 11 | 3 | 3 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
| 4 | 27 | 591 | 2 | 1 | 1 | 40 | 3 | 1 | 2 | 3468 | 16632 | 9 | 12 | 3 | 4 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
data_emp.shape
(1470, 45)
## Count the number of units in the attrition columns
data_emp["Attrition_Yes"].value_counts()
0 1233 1 237 Name: Attrition_Yes, dtype: int64
The attrition column contain unbalanced data.
# Dividing train data into X and y
X = data_emp.drop(["Attrition_Yes"], axis=1)
y = data_emp["Attrition_Yes"]
X.shape
(1470, 44)
# Splitting the dataset into training and testing set into a ratio of 70:30
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30,random_state=42)
print(f"The training dataset consits of {X_train.shape[0]} rows and {X_train.shape[1]} columns")
The training dataset consits of 1029 rows and 44 columns
print(f"The testing dataset consists of {X_test.shape[0]} rows and {X_test.shape[1]} columns")
The testing dataset consists of 441 rows and 44 columns
# Scale the data using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy using target and predicted values
recall = recall_score(target, pred) # to compute Recall using target and predicted values
precision = precision_score(target, pred) # to compute Precision using target and predicted values
f1 = f1_score(target, pred) # to compute F1-score using target and predicted values
# creating a dataframe of metrics
df_perf = pd.DataFrame( {"Accuracy": acc,"Recall": recall,"Precision": precision,"F1": f1},index=[0],)
return df_perf
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
models = {
"Logistic Regression": LogisticRegression(),
"K-Nearest Neighbors": KNeighborsClassifier(),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(),
"Bagging": BaggingClassifier(),
"Gradient Boosting": GradientBoostingClassifier(),
"Ada boost": AdaBoostClassifier(),
"Support Vector Machine": SVC(),
}
for name, model in models.items():
scores = cross_val_score(estimator=model, X=X_train,y=y_train,scoring="recall",cv=3)
print(f"{name} recall: {round(np.mean(scores),4)}")
Logistic Regression recall: 0.4265 K-Nearest Neighbors recall: 0.1418 Decision Tree recall: 0.3698 Random Forest recall: 0.1648 Bagging recall: 0.2329 Gradient Boosting recall: 0.3071 Ada boost recall: 0.4207 Support Vector Machine recall: 0.176
## Check the performamce for each model in the training data
logistic_train_perf = model_performance_classification_sklearn(models["Logistic Regression"].fit(X_train,
y_train),X_train,y_train)
kneighbor_train_perf = model_performance_classification_sklearn(models["K-Nearest Neighbors"].fit(X_train,
y_train),X_train,y_train)
decisiontree_train_perf = model_performance_classification_sklearn(models["Decision Tree"].fit(X_train,
y_train),X_train,y_train)
randomf_train_perf = model_performance_classification_sklearn(models["Random Forest"].fit(X_train,
y_train),X_train,y_train)
bagging_train_perf = model_performance_classification_sklearn(models["Bagging"].fit(X_train,
y_train),X_train,y_train)
gradient_train_perf = model_performance_classification_sklearn(models["Gradient Boosting"].fit(X_train,
y_train),X_train,y_train)
adaboost_train_perf = model_performance_classification_sklearn(models["Ada boost"].fit(X_train,
y_train),X_train,y_train)
svm_train_perf = model_performance_classification_sklearn(models["Support Vector Machine"].fit(X_train,
y_train),X_train,y_train)
### Display the peformance measures for each model
models_train_comp_df = pd.concat([logistic_train_perf.T, kneighbor_train_perf.T,decisiontree_train_perf.T,
randomf_train_perf.T, bagging_train_perf.T, gradient_train_perf.T, adaboost_train_perf.T, svm_train_perf.T],axis=1,)
models_train_comp_df.columns = [
"Logistic ","KNeighbor","DecisionTree","Random forest","Bagging","Gradient boost","Adaboost","SVM"]
print("Training performance measures comparison")
models_train_comp_df
Training performance measures comparison
| Logistic | KNeighbor | DecisionTree | Random forest | Bagging | Gradient boost | Adaboost | SVM | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.892 | 0.861 | 1.000 | 1.000 | 0.984 | 0.954 | 0.907 | 0.913 |
| Recall | 0.517 | 0.250 | 1.000 | 1.000 | 0.915 | 0.739 | 0.557 | 0.489 |
| Precision | 0.778 | 0.800 | 1.000 | 1.000 | 0.994 | 0.992 | 0.845 | 1.000 |
| F1 | 0.621 | 0.381 | 1.000 | 1.000 | 0.953 | 0.847 | 0.671 | 0.656 |
Gradient boost and Bagging outperformed better in the training data than other models with higher accuracy,recall and F1 score.
Since F1 score is characterized by minimizing both the false negative and false positive as compared to recall that miminize only the false negatives.
F1 score is a better score to use for performance measure.
## Check the performamce for each model in the testing data
logistic_test_perf = model_performance_classification_sklearn(models["Logistic Regression"].fit(X_test,
y_test),X_test,y_test)
kneighbor_test_perf = model_performance_classification_sklearn(models["K-Nearest Neighbors"].fit(X_test,
y_test),X_test,y_test)
decisiontree_test_perf = model_performance_classification_sklearn(models["Decision Tree"].fit(X_test,
y_test),X_test,y_test)
randomf_test_perf = model_performance_classification_sklearn(models["Random Forest"].fit(X_test,
y_test),X_test,y_test)
bagging_test_perf = model_performance_classification_sklearn(models["Bagging"].fit(X_test,
y_test),X_test,y_test)
gradient_test_perf = model_performance_classification_sklearn(models["Gradient Boosting"].fit(X_test,
y_test),X_test,y_test)
adaboost_test_perf = model_performance_classification_sklearn(models["Ada boost"].fit(X_test,
y_test),X_test,y_test)
svm_test_perf = model_performance_classification_sklearn(models["Support Vector Machine"].fit(X_test,
y_test),X_test,y_test)
### Display the peformance measures for each model
models_test_comp_df = pd.concat([logistic_test_perf.T, kneighbor_test_perf.T,decisiontree_test_perf.T,
randomf_test_perf.T, bagging_test_perf.T, gradient_test_perf.T, adaboost_test_perf.T, svm_test_perf.T],axis=1,)
models_test_comp_df.columns = [
"Logistic ","KNeighbor","DecisionTree","Random forest","Bagging","Gradient boost","Adaboost","SVM"]
print("Testing performance measures comparison")
models_test_comp_df
Testing performance measures comparison
| Logistic | KNeighbor | DecisionTree | Random forest | Bagging | Gradient boost | Adaboost | SVM | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.907 | 0.866 | 1.000 | 1.000 | 0.984 | 0.995 | 0.934 | 0.893 |
| Recall | 0.393 | 0.098 | 1.000 | 1.000 | 0.885 | 0.967 | 0.623 | 0.230 |
| Precision | 0.857 | 0.600 | 1.000 | 1.000 | 1.000 | 1.000 | 0.864 | 1.000 |
| F1 | 0.539 | 0.169 | 1.000 | 1.000 | 0.939 | 0.983 | 0.724 | 0.373 |
Gradient boost and bagging are still the better performing models with higher recall and F1 score.
However gradient boost will be regarded as the best model with a recall score: 0.967 and F1 score:0.983
%%time
# defining model
gradient_boost_tuned = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid= {"n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1],
"subsample":[0.5,0.7], "max_features":[0.5,0.7]}
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=gradient_boost_tuned, param_distributions=param_grid, scoring=scorer, n_iter=10, n_jobs = -1, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv=randomized_cv.fit(X_train,y_train)
# Set the clf to the best combination of parameters and fit the tuned model
gradient_boost_tuned=randomized_cv.best_estimator_
gradient_boost_tuned.fit(X_train,y_train)
CPU times: total: 1.17 s Wall time: 24.5 s
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, random_state=1, subsample=0.5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, random_state=1, subsample=0.5)## Code to check the performance on the training data
gradient_train_perf =model_performance_classification_sklearn(gradient_boost_tuned , X_train, y_train)
## Code to check the performance on testing data
gradient_test_perf = model_performance_classification_sklearn(gradient_boost_tuned , X_test, y_test)
performance_on_gboost=pd.concat([gradient_train_perf.T,gradient_test_perf.T],axis=1)
performance_on_gboost.columns=["Performance on training data","Performance on testing dataset"]
print("Gradient boost performance measures after model tuning")
performance_on_gboost
Gradient boost performance measures after model tuning
| Performance on training data | Performance on testing dataset | |
|---|---|---|
| Accuracy | 0.984 | 0.866 |
| Recall | 0.938 | 0.328 |
| Precision | 0.971 | 0.526 |
| F1 | 0.954 | 0.404 |
After hyperparametric model tuning, the recall and F1 score declined by an average of 62%.
The accuracy score declined slightly by 118 basis point(bp).
## Code to check the feature importance on the best model
feature_names = X.columns
importances = gradient_boost_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="blue", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Code to the employee attritions
attrition_pred = gradient_boost_tuned.predict(X_test)
df_series =pd.Series(attrition_pred)
df_series.value_counts()
0 403 1 38 dtype: int64
## Construct a pie chart for the predicted attrition.
plt.figure(figsize=(4,5))
values = df_series.value_counts() /df_series.shape[0]
plt.pie(values,labels=values.keys(),autopct="%1.1f%%")
plt.show()
The employee attrition dataset consist of 1470 rows and 35 columns.
Several classification models where use to predict factors that affect employee attritions.
Gradient boost model is used to select important features that had obvious impact on
the employee attrition.
To evaluate the model performance, we trained and tested the dataset to predict the employee attrition, split it into two parts(70% for training, 30% for testing ).
According to gradient boost model results the first most ten features that incudes,monthlyincome,overtime,age,daily rate,years at the company,stockoptionlevel,monthly rate,total working years, the number of companies worked and distance from home are the main reasons why people choose to resign.
In conclusion, i strongly suggest that the company should care more about their employees and improve their job satisfaction.
Simultaneously,they must pay more attention to human resources employees because they
have very low job satisfaction.
The company should allow employees to have enough time to rest and spend time with their families.